# A 65-nm 10-Gb/s 10-mm On-Chip Serial Link Featuring a Digital-Intensive Time-Based Decision Feedback Equalizer

Po-Wei Chiu<sup>®</sup>, Somnath Kundu<sup>®</sup>, Member, IEEE, Qianying Tang<sup>®</sup>, Student Member, IEEE, and Chris H. Kim, Senior Member, IEEE

Abstract—A digital-intensive on-chip serial link achieving a 10 Gb/s data rate over a 10-mm interconnect was demonstrated in a 65-nm GP process. A three-tap half-rate feed-forward equalizer was implemented for signal pre-emphasis in the transmitted block. On the receiver (RX) side, a two-tap half-rate time-based decision feedback equalizer was employed to cancel out inter-symbol interference noise. A  $2^{15}-1$  pseudorandom binary sequence generator and an in situ bit error rate (BER) monitor were designed for bit stream generation and convenient eyediagram measurements. The measured energy efficiency of the transmitter and RX was 31.9 and 45.3 fJ/b/mm, respectively, for a data rate of 10 Gb/s. A BER less than  $10^{-12}$  was verified for an eye width of 0.43 unit interval.

Index Terms—Digital intensive, eye diagram, feed-forward equalizer (FFE), in situ bit error rate (BER) monitor, pseudorandom binary sequence (PRBS), time-based decision feedback equalizer (TB-DFE).

#### I. Introduction

N-CHIP data buses are performance-critical circuits in modern processor systems as they are responsible for transferring massive amount of data between various processing units and cache blocks at gigahertz frequencies. Despite the aggressive pace of transistor scaling, interconnect speed and interconnect power have not scaled proportionally due to large *RC* parasitics, large die size, increasing number of processing units, and higher operating frequencies [1]–[6]. As shown in Fig. 1, when the data rate exceeds the RC time constant of an interconnect wire, significant inter-symbol interference (ISI) may occur. Meanwhile, due to the longer interconnect length

Manuscript received August 5, 2017; revised October 10, 2017 and November 5, 2017; accepted November 7, 2017. Date of publication December 6, 2017; date of current version March 23, 2018. This paper was approved by Guest Editor Ken Chang. This work was supported in part by the National Science Foundation Award under Grant CCF-1255937 and in part by the Semiconductor Research Corporation under Grant 2013-HJ-2409. (Corresponding author: Po-Wei Chiu.)

P.-W. Chiu and C. H. Kim are with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: chiux148@umn.edu).

- S. Kundu was with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA. He is now with the Wireless Communication Research Lab, Intel Labs, Hillsboro, OR 97124 USA.
- Q. Tang was with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA. She is now with the Central Hardware Engineering Division, Huawei Technologies, Shenzhen 518129, China.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2017.2774276

and high operating frequency, interconnect power has become a significant portion of the total chip power consumption.

A simple solution to overcome the interconnect bottleneck is adding repeater circuits to break up a long wire into shorter segments which enhances the overall latency and throughput of the channel [7], [8]. This method is effective and relatively straightforward to implement as synthesis tools support automated buffer insertion. However, tools may not be able to place repeaters at their desired locations due to large functional blocks underneath the interconnect path. Signals may have to be rerouted or the chip floorplan may have to be disrupted to accommodate the repeaters. This may result in additional design time, loss in performance, and increased power consumption due to redundant repeaters [9].

Recently, serial links have been gaining popularity for onchip point-to-point applications as they can achieve 10 Gb/s or higher data rates with high power efficiency without disrupting the chip floorplan. Many of them employ low-swing signaling which can lower the power consumption but requires sophisticated transmitter (TX) and receiver (RX) circuits. Equalization techniques such as feed-forward equalizer (FFE), continuous time linear equalizer (CTLE), and decision feedback equalizer (DFE) have been widely used for off-chip serial link applications, as shown in Fig. 2 [10]-[13]. FFE [12], [13] is a technique implemented in the TX block to pre-distort the signal to compensate for the channel loss. CTLE and DFE are implemented in the RX block. CTLE is basically an amplifier which provides a peaking gain to the signal frequency of interest [8], [10]. DFE on the other hand is used to cancel ISI noise in the incoming data stream by subtracting the ISI component estimated based on the preceding bits and proper weights [11]-[13]. The number of preceding bits and the weight values for the DFE filter are determined by the channel characteristics. Generally speaking, a channel with high loss requires a longer DFE filter. Some recent works have proposed using equalization techniques to improve the communication speed and energy efficiency of on-chip serial links. For example, a charge-injection-based FFE was proposed in [14] where capacitive coupling was used to predistort the TX signal [15], [16]. In [17], a current mode transceiver with a pre-emphasis driver and an active inductor were demonstrated. However, these on-chip serial links incorporate a complex analog-intensive design style which suffers from headroom issues and process-voltage-temperature effects.



Fig. 1. Repeater circuits can improve the latency and throughput of on-chip interconnects.



Fig. 2. Equalization techniques such as FFE, CTLE, and DFE have proven to be effective for high-speed IO applications.

Moreover, they do not take full advantage of the technology scaling benefits and require considerable re-design effort in every new technology. There has not been any report of DFE applied to on-chip links due to the complicated circuit design and large power overhead of current mode logic (CML).

In an effort to make DFE more digital-friendly and amenable to technology scaling, we propose a time-based DFE (TB-DFE) technique where the DFE operation is performed entirely in the time domain using digital circuits. Time-based circuits for DFE implementation was first proposed in [18] for off-chip links. However, their design was primarily based on analog-intensive circuits. In this paper, we demonstrate a digital-intensive design of time-based DFE utilizing inverters and digitally controlled delay elements which are readily available in advanced technologies. Another key advantage of the proposed TB-DFE is that higher number of taps can be incorporated by simply adding more delay stages without affecting the DFE throughput. Our work does not leverage this property as on-chip interconnects typically do not require more than two to three taps. However, we think TB-DFE can be a promising candidate for off-chip links or body channel communication applications where large number of taps is preferred.

The remainder of this paper is organized as follows. Section II describes conventional voltage mode DFE for comparison purpose. The proposed time-based DFE is described in Section III. Implementation details of a 10-mm on-chip serial link with a time-based DFE are given in Section IV. Section V provides details of the *in situ* bit error rate (BER) monitor used to measure the time-domain BER eye diagram. Measurement results are discussed in Section VI. Finally, conclusions are drawn in Section VII. The conference version of this paper was published in [19].



Fig. 3. Conventional DFE consisting of a CML summer and slicer circuit. DFE filter weights ( $W_1$  and  $W_2$ ) are incorporated into the CML tail current.



Fig. 4. Simulation unity gain frequency versus number of DFE taps for a conventional CML-based DFE circuit. Bandwidth decreases with higher number of DFE taps.

#### II. CONVENTIONAL DFE ARCHITECTURE

DFEs have become indispensable for off-chip links; however, they have not been adopted in on-chip links due to their design complexity and area/power overhead. For better understanding of the proposed time-based approach, we first describe the basic operation of the conventional DFE. Voltagebased DFE filter operation can be expressed as  $V_{\rm DFE} =$  $V_{\rm RX}(t) + \sum_i x[n-i] \cdot w_i$  where  $V_{\rm RX}(t)$  is the incoming analog voltage, x[n-i] is the ith preceding bit, and  $w_i$  is the corresponding weight. As shown in Fig. 3, the summer and slicer circuits of the conventional design are implemented using CML. The weight of each DFE filter tap corresponds to the CML tail current. The operation principle is as follows. The analog voltage signal  $V_{RX}(t)$  determines the pulldown current of the first CML stage. The pull-down current of the other CML stages are determined by the preceding data coupled with the appropriate weights  $W_1, W_2, \dots W_N$ . By subtracting the weighted sum of the preceding bits from the original analog signal, the ISI noise component can be cancelled out.

CML is a natural fit for DFE implementation. However, on-chip serial link applications cannot afford such



Fig. 5. Proposed time-based DFE architecture with two inverter-based delay lines and a PD. DFE weights control the delay of each inverter stage.



Fig. 6. High-level operating principle of the TB-DFE. The delay difference between the main delay line and the reference delay line determines the output bit.



Fig. 7. Maximum operating frequency and jitter for each TB-DFE output stage. Transient noise option was turned ON to account for the increased jitter in the later delay stages. Simulation results show that the operating frequency remains relatively constant even for a high number of DFE taps.

analog-intensive circuit solutions as they have to be integrated in a complex processor. Another fundamental limitation of CML-based DFE implementation is the throughput loss as the number of taps increases. This is because, parasitic capacitance increases linearly with the number of CML stages connected to the output node. In addition, the common mode voltage will change with different number of CML branches affecting the gain and linearity of the summer circuit. Complex biasing circuits are needed to mitigate this issue which makes DFE less



Fig. 8. Illustration of TB-DFE for a data pattern of "0100." Timing margin is enhanced by the TB-DFE circuit.

attractive for on-chip link applications. To further understand the performance limitations of CML, we simulated the unity gain frequency  $f_T$  for a CML circuit with different number of pull-down branches. As shown in Fig. 3, the total parasitic capacitor on node  $V_{\text{DFE}}$  is equal to  $C_{\text{Total}} = C_M + C_{T1} +$  $\cdots + C_{TN}$ , where  $C_M$  denotes the main tap capacitance and  $C_{\rm TN}$  denotes the Nth post tap capacitance. Fig. 4 shows the simulation results of  $f_T$  versus number of taps assuming an  $I_{\text{BIAS}}$  of 2 mA. The parasitic capacitance of each post tap was assumed to be 10% of the main tap capacitance. From the simulation results, we can see that increasing the number of taps degrades  $f_T$  and limits the maximum throughput of the DFE filter. The bandwidth degradation is not only caused by the increased parasitic but also by the reduced current of the main stage. The reduced main stage current also affects the common mode voltage level, necessitating the redesign and resizing of the CML circuit.

# III. PROPOSED TIME-BASED DFE

# A. Operation Principle of Time-Based DFE

In this section, we describe the proposed digital-intensive time-based DFE approach, where equalization is performed entirely in the time domain using digital circuits. The basic idea is to perform the time-domain version of the DFE filter expressed as  $T_{\text{DFE}} = T_{\text{RX}}(t) + \sum_i x[n-i] \cdot w_i$ . This can be achieved by replacing the CML summer and slicer circuit with an inverter-based delay line and phase detector (PD), as shown in Fig. 5. In conventional DFE, the weights are into the CML tail current while in TB-DFE, the weights are incorporated in the inverter delays. The high-level operation is



Fig. 9. Illustration of delay transformation technique. (a) Standard implementation requires six inverters. (b) Same operation [i.e.,  $sign(T_1 + T_2 + T_3 - 3T_{REF})$ ] can be performed using four inverters by folding inverter delay  $T_3$  to the lower path. For DFE implementations with a high number of taps, the total area and energy can be reduced by roughly 50% using this technique.



Fig. 10. Two-tap TB-DFE after applying the delay transformation technique described in Fig. 9.

as follows. As shown in Fig. 6, the input clock is fed to both the main delay line containing a voltage-to-time (VTC) stage, and the reference delay line. The incoming analog voltage signal is converted to an analog delay by the VTC stage. For instance, data "1" will result in a longer delay and vice versa. The VTC delay is then added to the delay of the later stages which are determined by the filter weights and previously sampled bits. The PD compares the delay of the main path with the delay of the reference path to generate the binary output. This is equivalent to the slicer operation in a conventional DFE, but in the time domain. The binary output is fed back to the DFE filter for the next cycle. The proposed time-based architecture has several unique advantages compared to the CML-based architecture such as good scalability, good low voltage operating margin, compact area, good tuning capability for process compensation, and no throughput loss for higher number of DFE taps.

The reason why there is no throughput loss is because the number of DFE taps can be increased by simply adding more delay stages. Adding more delay stages does not affect the parasitic capacitance of the individual stage and hence the



Fig. 11. (a) Circuit implementation of delay stage with analog control (i.e., VTC stage). (b) Delay change versus RX voltage.



Fig. 12. (a) Circuit implementation of delay stage with 6-bit digital control. (b) Delay change versus weight.

throughput remains constant regardless of the number of DFE taps. On the contrary, parasitic capacitance of the CML-based implementation increases linearly with the number of DFE taps. This degrades the DFE performance as shown in Fig. 4. The constant throughput behavior was verified by simulating the delay of the programmable delay stage. Two inverters were chained together to form a single non-inverting delay stage which is equivalent to a single DFE tap. The shift register in the feedback path does not contain any logic function and hence can operate faster than a single delay stage. To simulate the maximum operating frequency, we took the layout of a standard inverter chain and extracted the RC parasitic. Then we gradually increased the clock frequency until the inverter delay decreased by 0.2 ps. This corresponds to the frequency when the signal amplitude starts to degrade. Degradation in the signal amplitude has two detrimental effects. First, it will make the DFE delay compensation less precise. Second, the non-full swing signal will induce timing offset in the PD circuit. The maximum operating frequency was simulated while varying the number of delay stages (Fig. 7). For accurate timing simulations, random dynamic noise option was turned ON. This captures the impact of device noise introduced by the additional delay stages on clock jitter. In our simulation setup, the maximum noise frequency was set to be 5 times higher than the maximum operating frequency while the minimum noise frequency was set to be 1/(simulation time), which is 10 MHz. All other noise parameters were set to their default values. The baseline clock period without noise was set as 55 ps, which corresponds to a frequency of 18.2 GHz. To account for the slight increase in jitter in the later stages, we increased the clock period by two times the maximum jitter when measuring the maximum clock frequency of a particular stage. As seen in the simulated waveforms in Fig. 7, transient



Fig. 13. Block diagram of 65-nm test chip.



Fig. 14. Detailed implementation of three-tap half-rate FFE on TX side.

noise due to the additional delay stages results in only a modest increase in jitter and hence the maximum operating frequency remains relatively constant. For instance, only a 0.76 ps increase in clock period was necessary to account for transient noise effects in the ninth stage output. It is worth clarifying that why the maximum operating frequency in Fig. 7 is significantly higher than the measured frequency (i.e., 10 Gb/s) from the actual test chip. This discrepancy is due to the minimum capacitance configuration used in the simulation as well as the higher parasitic capacitance of the fabricated chip and the specific BER criterion.

Fig. 8 shows the operation example of the TB-DFE for a bit sequence of "0100". Due to the channel ISI, the original square waveform becomes a smoother waveform  $V_{\rm RX}(t)$  by the time it reaches the RX. The delay line signals before and after the DFE operation are shown in Fig. 8 (middle). The distorted voltage is converted to the corresponding time delay which contains ISI noise. This results in a reduced sensing margin between the DFE path signal  $N_{\rm DFE}$  and the reference path signal  $N_{\rm REF}$ . The reduced phase difference may lead to a bit error. TB-DFE utilizes the preceding bits to expand the phase difference leading to a more reliable phase detection. The improvement in sensing window using the TB-DFE is illustrated in Fig. 8 (bottom).

#### B. Delay Transformation Technique

The power consumption and chip area of the TB-DFE can be further reduced by folding half of the DFE delay line to the reference delay line. Fig. 9 shows a simplified example of a three stage TB-DFE with the proposed delay transformation technique. Delays  $T_1$ ,  $T_2$ , and  $T_3$  of the upper delay line are

controlled by the current input signal, preceding bits, and the DFE weights. Delay of the lower reference delay line is fixed. The PD compares the two delays  $T_1 + T_2 + T_3$ , and  $3T_{REF}$  to determine the output bit. If we move  $T_3$  to the lower delay line but with a negative polarity and positive delay offset (i.e.,  $2T_{\text{REF}} - T_3$ ) as shown in Fig. 9(b), the phase comparison result remains the same as the original implementation. That is, the same DFE function can be realized with fewer delay stages which translates into a lower power consumption and smaller chip area. For long delay lines, the power and area can be cut down by almost 50% using this technique. An important point to note is that all delay stages including the one denoted by  $2T_{REF} - T_3$  can be implemented using the same exact circuit. This is possible because the delay ranges for  $T_1$ ,  $T_2$ , and  $2T_{REF} - T_3$  are the same when the nominal values of  $T_1$ ,  $T_2$ , and  $T_3$  are equal to  $T_{REF}$ . This can be explained using the following example. Suppose delays  $T_1$ ,  $T_2$ , and  $T_3$  are all centered around  $T_{REF}$  with a programmable delay range of  $\pm \Delta T$ 

$$\max(T_1, T_2, T_3) = T_{\text{REF}} + \Delta T$$
  
 $\min(T_1, T_2, T_3) = T_{\text{REF}} - \Delta T$ .

Then the range of  $2T_{REF} - T_3$  can be calculated as follows:

$$\begin{aligned} \max(2T_{\text{REF}} - T_3) &= 2T_{\text{REF}} - \min(T_3) \\ &= 2T_{\text{REF}} - (T_{\text{REF}} - \Delta T) = T_{\text{REF}} + \Delta T \\ \min(2T_{\text{REF}} - T_3) &= 2T_{\text{REF}} - \max(T_3) \\ &= 2T_{\text{REF}} - (T_{\text{REF}} + \Delta T) = T_{\text{REF}} - \Delta T. \end{aligned}$$

As shown in these simple equations, the new delay stage (i.e.,  $2T_{REF} - T_3$ ) has the same delay range as other three delay stages  $T_1$ ,  $T_2$ , and  $T_3$ . This allows us to utilize the



Fig. 15. Detailed implementation of RX block including TIA and two-tap half-rate TB-DFE.



Fig. 16. Zero-offset aperture PD circuit adopted in this paper [21].

same circuit for all delay stages which ensures a uniform layout with minimum delay mismatch. The output of the PD circuit is determined by the arrival time difference between the two inputs rather than the absolute delay value of each delay line. So the TB-DFE is inherently tolerant to voltage and temperature drifts or noise affecting both delay lines equally. Another subtle but important point is that with the proposed delay transformation, the number of delay stages required for an N-tap DFE is just N/2. In contrast, an N-tap DFE operation using CML requires N (not N/2) pull-down paths. So, when comparing the results in Figs. 4 and 7, we must account for the  $2\times$  difference in the number of taps. For instance, an eight-tap delay in Fig. 4 corresponds to a four-tap delay in Fig. 7, and so on. This makes TB-DFE even more attractive compared to CML-based DFE for large N values.

Implementation of a two-tap TB-DFE using the proposed delay transformation technique is shown in Fig. 10. One of the two digital-controlled delay stages is folded into the reference delay path with a complementary weight, which serves as a DFE tap. The first delay stage in each line is an analog-controlled delay stage serving as a VTC. The incoming analog signal from the channel is connected to the VTC of the upper delay line. The input voltage to the bottom VTC is hardwired to half VDD. The symmetric configuration between the upper and lower delay lines minimizes layout mismatch and common mode noise issues.

## C. Analog and Digital Delay Stages

Fig. 11(a) shows the detailed implementation of the analog delay control stage which consists of four parallel tri-state inverters (including one always-ON inverter) and three MOS capacitors. The control signals of the parallel inverters and



Fig. 17. Zero-offset aperture PD simulation results.

capacitors are directly connected to the incoming analog voltage to maximize delay range. Fig. 11(b) shows the delay change versus input RX voltage relationship. Non-linearity of the VTC transfer curve will affect the TB-DFE operation since the voltage to delay conversion will be distorted. To get around this issue, we utilized the linear portion of the VTC transfer curve (i.e., from 0.5 to 1.1 V) by appropriately sizing the inverter's pull-up and pull-down devices. As shown in Fig. 11(b), the delay range corresponding to this input voltage range is 10 ps. Fig. 12(a) shows the detailed implementation of the digital delay control stage for positive delay compensation. The circuit is almost identical to the analog delay stage except that the capacitor is connected to VDD rather than ground for a wider delay range. The delay is controlled by the feedback data D and a 6-bit weight w < 5:0 >. Signal D determines whether or not the weight is applied to the delay stage. The 6-bit weight determines how many inverters and capacitors are turned ON. The implementation of negative delay compensation is verbatim, with the exception of using a complementary digital code. Fig. 12(b) shows the difference between the actual delay and the reference delay versus different configurations. To achieve both positive and negative compensation, the delay of the reference line is set near the average between the D=0 and D=1 delays. With a 6-bit control, we were able to program the weights to their desired values with sufficient accuracy. Hence, non-linearity of



Fig. 18. (a) Conventional BER eye diagram is plotted in the voltage-versus-time domain. (b) Time-domain BER eye diagram is plotted in time versus time domain.



Fig. 19. In situ bathtub and BER eye-diagram measurement circuits.

the delay curve is not a critical concern for proper TB-DFE operation.

#### IV. CIRCUIT DESIGN

Fig. 13 shows the overall block diagram of the 10-mm on-chip transceiver system we implemented in a 65-nm GP process. It contains an on-chip clock generator which can provide a 5-GHz differential clock for TX and RX circuits. Channel data are generated by an on-chip PRBS circuit. A three-tap half-rate FFE was implemented in the TX block to de-emphasize the output signal and transmit the data over a 10 mm by 2  $\mu$ m M9 channel. The 2  $\mu$ m is the minimum metal width of M9 metal layer in this technology. An inverterbased transimpedance amplifier (TIA) followed by a two-tap half-rate TB-DFE was designed for the RX block. On-chip monitoring circuits for in situ bathtub and BER eye-diagram measurements were included. A two-tap DFE was found to be sufficient for a 10-mm channel for our target frequency. This was also verified by experimental data shown in Section V. For on-chip serial link applications, the system clock is available everywhere inside the chip so that there is no need for separate clock and data recovery circuits. BER data in Section IV captures non-ideal effects such as clock jitter as we have implemented the clock generator inside the test chip.

#### A. Transmitter

Fig. 14 shows the detailed implementation of the three-tap half-rate FFE. The data stream was generated by two independent PRBS units each operating at 5 Gb/s. The two bit streams were fed to two separate data paths. Different PRBS algorithms were implemented to ensure good randomness in the combined 10-Gb/s data. The differential 5-GHz clock samples the two data streams in an alternating manner

achieving half-rate operation. Three flip-flops drive the percursor, main-cursor, and post-cursor, respectively, to support a three-tap FFE operation. The multiplexer after the third flip-flop stage combines the two data streams into a single 10-Gb/s data stream. A voltage mode output driver operating at 10 Gb/s is implemented using a bank of inverters and two shared output resistors for TX impedance matching [20]. The 4-bit FFE weights determine the de-emphasis level.

## B. Receiver

Details of the RX are shown in Fig. 15. An inverter-based TIA with a 4-bit programmable transmission gate bank ensures good RX impedance matching. The simulated bandwidth of the TIA was 4.2-9.9 GHz depending on the transmission gate configuration. A differential 5-GHz clock drives two parallel DFE paths achieving a combined data rate of 10 Gb/s. After the TIA, the voltage signal is converted to time delay  $T_{\rm RX}$  by the VTC.  $T_{\rm REF}$  is tunable in our test chip and was fixed to roughly the average of data "1" and "0" delays. The second delay stage is a digital delay stage controlled by the feedback data and the appropriate weights. With the delay transformation technique described in Section III- B, the second tap can be moved to the bottom delay line. A third programmable delay stage was added for testing purposes (i.e., eye-diagram measurements). We will discuss time-domain eye-diagram measurements in Section V.

A zero-offset aperture PD is shown in Fig. 16, composing of a Set-Reset-latch and a flip-flop, was adopted in our design [21]. Simulation results of the zero-offset aperture PD are shown in Fig. 17. The *x*-axis represents the arrival time difference between the two input signals and the *y*-axis represents the delay between the early input signal and the PD output. When the upper input signal arrives earlier, then the flip-flop will sample a "1," and vice versa. If the two



Fig. 20. Measured BER bathtub curves ( $<10^{-10}$ ) for one-tap and two-tap DFEs. BER is plotted for two consecutive bits. For our target application, a two-tap DFE offers marginal improvement in BER compared to a one-tap DFE.



Fig. 21. Measured BER bathtub curves ( $<10^{-12}$ ) with and without TB-DFE for two consecutive bits.

inputs fall within the aperture time window, then the PD delay increases which can eventually result in a metastable output response. This is not a concern in our design as the input phase difference is sufficiently large compare to the aperture time window.

## V. In Situ BER EYE-DIAGRAM MEASUREMENTS

High-speed serial links are traditionally characterized using off-chip equipment such as BER-tester, gigahertz clock generator, and high-frequency probes/cables. High-speed serial link measurements can be easily corrupted by any non-ideal connection between the test equipment and IO pad. Everincreasing clock frequencies and rising test costs have motivated designers to adopt on-chip BER measurement solutions. This testing approach has several advantages over off-chip equipment-based testing such as simpler setup, lower cost, ease of test automation, higher resolution, and reduced noise. This is particularly true for on-chip serial links as they are embedded deeply inside a processor chip with no connection



Fig. 22. Measured BER eye diagram for two consecutive bits.



Fig. 23. Supply voltage versus data rate and energy efficiency.

to the outside world. In this paper, we designed *in situ* BER and eye-diagram measurement circuits as part of the 65-nm test chip. Before we discuss the detail implementation of measurement circuit, we first introduce the concept of a "time domain" eye diagram. Fig. 18(a) shows a typical BER eye diagram for voltage mode circuits which are obtained by sweeping the sampling time point (*x*-axis) and the offset voltage (*y*-axis). Such a voltage-versus-time BER eye diagram does not apply to our proposed TB-DFE as the voltage is immediately converted to time. So instead, we propose a time-domain BER eye diagram where the *y*-axis is the delay offset, as shown in Fig. 18(b). The delay offset was implemented using a separate delay stage in the delay lines.

Fig. 19 displays the overall block diagram of the *in situ* BER and eye-diagram measurement circuit. The leftmost box denoted by "Phase Delay" is used to sweep the x-axis. This programmable phase delay allows the clock to sample data over a 2 unit interval (UI) range allowing BER eye-diagram measurement across two cycles. The box denoted by " $\Delta T$ ," which is the third stage in the delay line, is for the delay offset representing the y-axis. Each programmable delay has



| Technology     | 65nm GP CMOS        |  |  |  |
|----------------|---------------------|--|--|--|
| Core Size      | TX : 23µm x 24µm    |  |  |  |
| Core Size      | RX : 30µm x 59µm    |  |  |  |
| VDD            | 1.2V                |  |  |  |
| Data Rate      | 10Gb/s              |  |  |  |
| Channel Length | 10 mm               |  |  |  |
| BER            | < 10 <sup>-12</sup> |  |  |  |
| Throughput     | 2 Gb/s/µm           |  |  |  |
| Energy         | TX: 31.9 fJ/b/mm    |  |  |  |
| Efficiency     | RX : 45.3 fJ/b/mm   |  |  |  |

Fig. 24. Die photograph and feature summary table.

|                                   | ISSCC'09 [14]            | ISSCC'10 [15]       | ISSCC'12 [16]                  | ISSCC'13 [17]               | VLSI'15 [8]             | This work                    |
|-----------------------------------|--------------------------|---------------------|--------------------------------|-----------------------------|-------------------------|------------------------------|
| Technology                        | 90nm                     | 90nm                | 65nm                           | 65nm                        | 65nm                    | 65nm                         |
| TX and RX                         | Charge Injection FFE+TIA | Capacitively driven | Capacitively driven+sense amp. | Current mode<br>transceiver | CTLE-based repeater     | Voltage mode<br>driver+TIA   |
| Features                          | No DFE                   | No DFE              | No DFE                         | No DFE                      | No DFE                  | 2-tap TB-DFE                 |
| Data Rate                         | 4Gb/s                    | 4.9Gb/s             | 10Gb/s                         | 3Gb/s                       | 4Gb/s                   | 10Gb/s                       |
| Throughput<br>(Gb/s/µm)           | 2                        | 4.4                 | 2.56                           | 0.75                        | 4                       | 2                            |
| Link Length                       | 10mm                     | 5mm                 | 6mm                            | 10mm                        | 2.5mm+2.5mm             | 10mm                         |
| BER Bathtub                       | < 10E-6                  | < 10E-10            | < 10E-12                       | < 10E-12                    | < 10E-12                | < 10E-12                     |
| BER Eye                           | Yes (< 10E-6)            | No                  | No                             | Yes (< 10E-12)              | No                      | Yes (< 10E-11)               |
| Eye Width                         | 0.5UI*<br>@BER=10E-6     | N/A                 | N/A                            | 0.48UI*<br>@BER=10E-12      | 0.48UI**<br>@BER=10E-12 | 0.43UI**<br>@BER=10E-12      |
| Energy<br>Efficiency<br>(fJ/b/mm) | 35.6                     | 68                  | 174                            | 9.5                         | 48.4                    | TIA 14.4  DFE 30.9  FFE 31.9 |

\*Eye width \*\*Bathtub width

Fig. 25. Comparison with previous on-chip serial links.

a 6-bit control. The BER monitor compares the data from the  $2^{15}-1$  PRBS data D with the DFE output data D' using a two-input XOR gate. The error signal increments the 11-bit error counter. Finally, the error count is periodically read out for a given xy configuration and the BER is computed based on the total number of cycles and the error count. By sweeping phase delay and time offset, we can obtain the BER bathtub and BER eye diagram.

## VI. MEASUREMENT RESULTS

In this section, we present the bathtub and time-domain BER eye diagrams measured using the proposed on-chip circuits. Fig. 20 shows the BER bathtub while sweeping the phase delay for one-tap and two-tap DFE configurations. BER down to  $10^{-10}$  was measured. No noticeable improvement in BER was seen by increasing the DFE length from one to two taps. So, it can be concluded that for the 10-mm channel implemented in our test chip and for our target frequency, a one-tap DFE is enough to remove ISI noise. Fig. 21 shows the bathtub curves with and without TB-DFE down to a BER of  $10^{-12}$ . Without the TB-DFE, the lowest BER we could achieve was only  $10^{-10}$ . After applying the TB-DFE, a BER less than  $10^{-12}$  can be achieved while maintaining an eye

width of 0.43 UI. Fig. 22 shows the time-domain BER eye diagram for two consecutive bits. The time offset can be controlled with 6-bit precision, so 64 codes are shown in the y-axis. To save test time, BER down to  $10^{-11}$  (not  $10^{-12}$ ) was measured for the eye diagram. Results show that a BER less than  $10^{-11}$  was achieved for an eye width of 0.5 UI. Fig. 23 displays the energy efficiency and data rate measured at different supply voltages. The purpose of Fig. 23 is to verify good low-voltage performance of the TB-DFE. The data point at 1.2 V is based on a BER criteria of  $10^{-12}$  while the other data points are for a BER of  $10^{-9}$  due to test time limitations.

The die photograph and summary table are shown in Fig. 24. The TX and RX blocks occupy a chip area of  $23 \times 24~\mu\text{m}^2$  and  $30 \times 59~\mu\text{m}^2$ , respectively. One thing to note is that the RX area includes test circuits such as the delay offset stage, which occupies about 1/3 of the RX circuit area, so the actual circuit area will be significantly smaller. The throughput per channel 2 Gb/s/ $\mu$ m for a data rate of 10 Gb/s and a channel length of 10 mm. The energy efficiency of the TX and RX (not including BER monitor) blocks are 31.9 and 45.3 fJ/b/mm, respectively, at 1.2 V, 10 Gb/s. Fig. 25 compares the proposed design with other state-of-the-art on-chip links. This paper represents the first on-chip serial link with a time-based DFE utilizing digital-intensive circuits.

# VII. CONCLUSION

In this paper, an inverter-based two-tap half-rate TB-DFE is demonstrated on a 10-mm on-chip serial link in a 65-nm GP process. Our proposed TB-DFE leverages digital-intensive circuits for good scalability, good low voltage operation, low power, compact implementation, short design time, and digital programmability. Higher number of taps can be incorporated without incurring any throughput loss by simply adding more delay stages. This could be particularly beneficial for serial link applications requiring a longer DFE filter. The concept of time-domain BER eye diagram was introduced along with in situ BER measurement circuits for reduced test effort and improved test accuracy. Circuit performance was verified using in situ BER measurement circuits. Experimental data from the 65-nm test chip shows that the proposed digital-intensive serial link can be a viable option for future on-chip interconnect applications.

## REFERENCES

- R. Ho, K. W. Mai, and M. A. Horowitz, "The future of wires," *Proc. IEEE*, vol. 89, no. 4, pp. 490–504, Apr. 2001.
- [2] J. Cong, "An interconnect-centric design flow for nanometer technologies," *Proc. IEEE*, vol. 89, no. 4, pp. 505–528, Apr. 2001.
- [3] Y. I. Ismail and E. G. Friedman, "Effects of inductance on the propagation delay and repeater insertion in VLSI circuits," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 8, no. 2, pp. 195–206, Apr. 2000.
- [4] R. Bashirullah, W. Liu, and R. K. Cavin, "Current-mode signaling in deep submicrometer global interconnects," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 11, no. 3, pp. 406–417, Jun. 2003.
- [5] D. Schinkel, E. Mensink, E. A. M. Klumperink, E. van Tuijl, and B. Nauta, "A 3-Gb/s/ch transceiver for 10-mm uninterrupted RC-limited global on-chip interconnects," *IEEE J. Solid-State Circuits*, vol. 41, no. 1, pp. 297–306, Jan. 2006.
- [6] A. P. Jose, G. Patounakis, and K. L. Shepard, "Pulsed current-mode signaling for nearly speed-of-light intrachip communication," *IEEE J. Solid-State Circuits*, vol. 41, no. 4, pp. 772–780, Apr. 2006.
- [7] P. Larsson-Edefors, "Investigation on maximal throughput of a CMOS repeater chain," *IEEE Trans. Circuits Syst. I, Fundam. Theory Appl.*, vol. 47, no. 4, pp. 602–606, Apr. 2000.
- [8] M.-S. Chen, M.-C. F. Chang, and C.-K. K. Yang, "A low-PDP and low-area repeater using passive CTLE for on-chip interconnects," in *Proc. Symp. VLSI Circuits (VLSI Circuits)*, Kyoto, Japan, Jun. 2015, pp. C244–C245.
- [9] L. Zhang, J. M. Wilson, R. Bashirullah, L. Luo, J. Xu, and P. D. Franzon, "A 32-Gb/s on-chip bus with driver pre-emphasis signaling," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 17, no. 9, pp. 1267–1274, Sep. 2009.
- [10] S. Gondi, J. Lee, D. Takeuchi, and B. Razavi, "A 10 Gb/s CMOS adaptive equalizer for backplane applications," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, vol. 1. Feb. 2005, pp. 328–329.
- [11] V. Balan *et al.*, "A 4.8–6.4-Gb/s serial link for backplane applications using decision feedback equalization," *IEEE J. Solid-State Circuits*, vol. 40, no. 9, pp. 1957–1967, Sep. 2005.
- [12] K.-L. J. Wong and C.-K. K. Yang, "A serial-link transceiver with transition equalization," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2006, pp. 223–232.
- [13] H. Wang and J. Lee, "A 21-Gb/s 87-mW transceiver with FFE/DFE/analog equalizer in 65-nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 909–920, Apr. 2010.
- [14] B. Kim and V. Stojanovic, "A 4 Gb/s/ch 356 fJ/b 10 mm equalized on-chip interconnect with nonlinear charge-injecting transmit filter and transimpedance receiver in 90 nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2009, pp. 66–67 and 67a.
- [15] J. Seo, R. Ho, J. Lexau, M. Dayringer, D. Sylvester, and D. Blaauw, "High-bandwidth and low-energy on-chip signaling with adaptive preemphasis in 90 nm CMOS," in *IEEE Int. Solid-State Circuits Conf.* (ISSCC) Dig. Tech. Papers, Feb. 2010, pp. 182–183.

- [16] D. Walter et al., "A source-synchronous 90Gb/s capacitively driven serial on-chip link over 6 mm in 65 nm CMOS," in *IEEE Int. Solid-State* Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2012, pp. 180–182.
- [17] S.-K. Lee, S.-H. Lee, D. Sylvester, D. Blaauw, and J. Y. Sim, "A 95 fJ/b current-mode transceiver for 10 mm on-chip interconnect," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2013, pp. 262–263.
- [18] I.-M. Yi et al., "A time-based receiver with 2-tap DFE for a 12 Gb/s/pin single-ended transceiver of mobile DRAM interface in 0.8 V 65 nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 400–401.
- [19] P.-W. Chiu, S. Kundu, Q. Tang, and C. H. Kim, "A 10 Gb/s 10 mm onchip serial link in 65 nm CMOS featuring a half-rate time-based decision feedback equalizer," in *Proc. Symp. VLSI Circuits (VLSI Circuits)*, Kyoto, Japan, Jun. 2017, pp. C56–C57.
- [20] M. Kossel *et al.*, "A T-coil-enhanced 8.5 Gb/s high-swing SST transmitter in 65 nm bulk CMOS with ≪ −16 dB return loss over 10 GHz bandwidth," *IEEE J. Solid-State Circuits*, vol. 43, no. 12, pp. 2905–2920, Dec. 2008.
- [21] S. Kundu, B. Kim, and C. H. Kim, "A 0.2–1.45-GHz subsampling fractional-N digital MDLL with zero-offset aperture PD-based spur cancellation and in situ static phase offset detection," *IEEE J. Solid-State Circuits*, vol. 52, no. 3, pp. 799–811, Mar. 2017.



Po-Wei Chiu was born in Tainan, Taiwan, in 1989. He received the B.S. and M.S. degrees in electrical engineering from National Tsing Hua University, Hsinchu, Taiwan, in 2011 and 2013, respectively. He is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA.

His current research interests include high-speed analog and mixed-signal circuit design, such as high-speed serial link and optical I/O.



Somnath Kundu (S'13–M'16) received the B.E. degree in electronics and telecommunication engineering from Jadavpur University, Kolkata, India, in 2008, the M.S. (Research) degree in electrical engineering from the Indian Institute of Technology, Delhi, India, in 2012, and the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, MN, USA.

From 2008 to 2012, he was an Analog Design Engineer with STMicroelectronics, Greater Noida, India, where he was involved in the transmitter,

PLL, and bias design for different high-speed serial link IPs. He joined Xilinx, San Francisco Bay Area, CA, USA, as an Intern, in 2014, and Rambus, San Francisco Bay Area, CA, USA, as an Intern, in 2015. He also joined the Circuit Research Lab, Intel Labs, Hillsboro, OR, USA, as an Intern, in 2015. He is currently a Research Scientist with Wireless Communication Research Lab, Intel Labs. His current research interests include digital intensive mixed-signal circuits and radio-frequency integrated circuit design such as clock generators, wireless and wireline transceivers, analog-to-digital converters, and voltage regulators.

Dr. Kundu was a recipient of the Best Student Paper Award in the 2013 IEEE International Conference on VLSI Design.



Qianying Tang (S'13) received the B.E. degree in electrical engineering from the University of Electronic Science and Technology of China, Chengdu, China, in 2011, and the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, MN, USA, in 2017.

She joined the VLSI Research Lab, University of Minnesota, in 2012, with a focus on design for characterizing circuit reliability including random telegraph noise and radiation-induced soft errors, and circuit design for hardware security such as true

random number generators and physical unclonable functions. She was an Intern with IBM, Fishkill, NY, USA, in 2015. She is currently an Engineer with the Central Hardware and Engineering Division, Huawei Technologies Co. Ltd., Shenzhen, China.

Dr. Tang received the Fellowship Award from the China Scholarship Council from 2011 to 2015.



**Chris H. Kim** (M'04–SM'10) received the B.S. and M.S. degrees from Seoul National University, Seoul, South Korea, and the Ph.D. degree from Purdue University, Lafayette, IN, USA.

He was with Intel Corporation, Hillsboro, OR, USA, where he performed research on variation-tolerant circuits, on-die leakage sensor design, and crosstalk noise analysis. He joined the Electrical and Computer Engineering Faculty, University of Minnesota, Minneapolis, MN, USA, in 2004, where he is currently a Professor. He has authored or co-

authored over 200 journal and conference papers. His current research interests include digital mixed-signal, and memory circuit design in silicon and non-silicon (organic TFT and spin) technologies.

Dr. Kim was a recipient of the SRC Technical Excellence Award, the Council of Graduate Students Outstanding Faculty Award, the NSF CAREER Award, a Mcknight Foundation Land-Grant Professorship, the 3M Nontenured Faculty Award, DAC/ ISSCC Student Design Contest Awards, the IBM Faculty Partnership Awards, the IEEE Circuits and Systems Society Outstanding Young Author Award, and the ISLPED Low Power Design Contest Awards. He has served as the Technical Program Committee Chair for the 2010 International Symposium on Low Power Electronics and Design.